Add opentelemetry to Armada#4973
Conversation
f0f80ca to
39e0cbb
Compare
Greptile SummaryThis PR adds OpenTelemetry distributed tracing to all Armada services, wiring up OTLP exporters, gRPC/HTTP instrumentation, a span attribute policy processor, and a local dev stack with Jaeger + otel-collector.
Confidence Score: 3/5Two concrete defects need fixing before merge: duplicate otelgrpc handlers on one code path produce double spans, and the cardinality guard will silently corrupt job-level trace attributes in any production-scale deployment. The duplicate otelgrpc.NewClientHandler() registration in server.go + connection.go means every RPC from the server's internal API connection generates two client spans and doubles gRPC metrics. The cardinality tracker caps all armada.* attributes at 1000 unique values; since job IDs are unique per job, traces will show [HIGH_CARDINALITY] for job-identifying attributes after the first thousand jobs, making the tracing essentially useless for its primary use case in production. internal/server/server.go (duplicate stats handler), internal/common/observability/attribute_policy.go (cardinality exemption for armada.* attributes) Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant Svc as Armada Service
participant OTel as InitOTel()
participant TP as TracerProvider
participant OTLP as OTLP Exporter
participant Coll as OTel Collector
participant Jaeger as Jaeger UI
Svc->>OTel: InitOTel(cfg)
OTel->>OTLP: newTraceExporter(ctx, cfg)
OTLP-->>OTel: exporter (http or grpc)
OTel->>TP: NewTracerProvider(AttributePolicyProcessor, BatchSpanProcessor, Sampler)
OTel-->>Svc: otel.SetTracerProvider(tp)
Svc->>TP: tracer.Start(ctx, operation)
TP->>TP: SpanAttributePolicyProcessor.OnStart
TP-->>Svc: span
Svc->>TP: span.End()
TP->>TP: SpanAttributePolicyProcessor.OnEnd
TP->>OTLP: BatchSpanProcessor export
OTLP->>Coll: OTLP HTTP/gRPC
Coll->>Jaeger: OTLP gRPC
Svc->>OTel: ShutdownOTel(ctx)
OTel->>TP: tp.Shutdown(ctx)
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant Svc as Armada Service
participant OTel as InitOTel()
participant TP as TracerProvider
participant OTLP as OTLP Exporter
participant Coll as OTel Collector
participant Jaeger as Jaeger UI
Svc->>OTel: InitOTel(cfg)
OTel->>OTLP: newTraceExporter(ctx, cfg)
OTLP-->>OTel: exporter (http or grpc)
OTel->>TP: NewTracerProvider(AttributePolicyProcessor, BatchSpanProcessor, Sampler)
OTel-->>Svc: otel.SetTracerProvider(tp)
Svc->>TP: tracer.Start(ctx, operation)
TP->>TP: SpanAttributePolicyProcessor.OnStart
TP-->>Svc: span
Svc->>TP: span.End()
TP->>TP: SpanAttributePolicyProcessor.OnEnd
TP->>OTLP: BatchSpanProcessor export
OTLP->>Coll: OTLP HTTP/gRPC
Coll->>Jaeger: OTLP gRPC
Svc->>OTel: ShutdownOTel(ctx)
OTel->>TP: tp.Shutdown(ctx)
Reviews (5): Last reviewed commit: "Implement otel and integrate with all se..." | Re-trigger Greptile |
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
cc59894 to
1e11e3d
Compare
| deniedContains: []string{"password", "secret", "token", "api_key", "apikey", "key"}, | ||
| cardinalityExemptPrefixes: []string{"rpc.", "http.", "net.", "server.", "service."}, |
There was a problem hiding this comment.
armada.* attributes not cardinality-exempt — job IDs will be silently replaced in production
armada. is in allowedPrefixes but absent from cardinalityExemptPrefixes. Any high-cardinality armada.* attribute — e.g. armada.job_id or armada.run_id, which are unique per submitted job — will hit the 1000-value cap after the first thousand distinct jobs and be replaced with [HIGH_CARDINALITY] for every subsequent job. In a production Armada cluster that processes more than 1000 jobs between restarts, all job-level trace attributes would become unreadable, directly undermining the observability goal of this PR. armada. should be added to cardinalityExemptPrefixes (or the limit for that prefix set much higher), and truly high-cardinality keys like armada.job_id should be listed explicitly in cardinalityExemptExact.
What type of PR is this?
Enhancement
What this PR does / why we need it
Improve observability of armada services and their interactions